Appendix D — Joins

If you work with biodiversity data, odds are high that you will need to join 2 separate datasets at some point to test how spatial, temporal or environmental factors influence species. This chapter is a brief overview of several common types of joins dplyr to help get started. For a comprehensive introduction to joins, check out the Joins chapter in R for Data Science.

D.0.1 Prerequisites

In this chapter, we will use pardalote occurrence data from 2015 in the ALA.

# packages
library(galah)
library(dplyr)
library(here)
library(ggplot2)
galah_config(email = "your-email-here") # ALA-registered email

pardalotes <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.3fb2325f-cea0-4114-a9c6-3e28a2c204ee") |>
  atlas_occurrences()

pardalotes_taxonomy <- galah_call() |>
  identify("Pardalotus") |>
  atlas_species()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

pardalotes <- galah_call() |>
  identify("Pardalotus") |>
  filter(year == 2015) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

D.1 Keys

Joining dataframes relies on setting a key—one or more columns that exist in a primary table that correspond to one or more columns in a second table. Two dataframes that we intend to join are matched according to the designated key.

As a basic example, let’s say we want to add complete taxonomic information to our pardalotes dataframe, which contains occurrence records with some, but not all, taxonomic information. pardalotes_taxonomy contains complete taxonomic information for Pardalotus.

Let’s join our pardalotes dataframe with pardalotes_taxonomy. The column scientificName in pardalotes appears to contain (in general) the same information in column species_name in pardalotes_taxonomy.

We can use these columns as a key to join the additional taxonomic information to our pardalotes occurrence records. We can also use the genus column as a second key to match any observations that have been identified down to genus, but not species, level.

pardalotes |>
  left_join(pardalotes_taxonomy, 
            join_by(scientificName == species_name, 
                    genus == genus)
            ) |>
  
  rmarkdown::paged_table() # paged output

D.2 Basic types of joins

There are many types of joins that can help you in all kinds of situations! Join types generally fall within two categories:

  • Mutating joins combine variables from two tables (e.g., left_join(), right_join(), full_join())
  • Filtering joins combine variables, and additionally keep or remove rows that do not match the key column (e.g., semi_join(), anti_join())

Below are a few common examples of join types. Examples and animations are taken from Garrick Aden-Buie’s tidyexplain animations.

x <- tibble(id = c(1, 2, 3),
            x = c("x1", "x2", "x3"))

y <- tibble(id = c(1, 2, 4),
            y = c("y1", "y2", "y4"))

D.3 Spatial joins

It can be useful to summarise where species or taxonomic groups occur by grouping by a spatial region (e.g., state, council area, bioregion). To do so, records or summary statistics need to be linked to the region they pertain to. Usually, this requires a join between a spatial object and a dataframe, or between two spatial objects.

As an simple example, let’s download a shapefile of Australian states and territories using the ozmaps package. Our aus object contains the name of each state/territory (NAME) and its shape (geometry)1.

library(sf)
library(ozmaps)

aus <- ozmap_states |>
  st_transform(4326)

aus
1
This line transforms the CRS projection of our map to match the CRS projection of ALA data.
Simple feature collection with 9 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 2
  NAME                                                                  geometry
* <chr>                                                       <MULTIPOLYGON [°]>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782, 150.6…
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259, 146.6…
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575, 148.85…
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885, 137.3…
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264, 126.3…
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258, 147.8…
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922, 136.3…
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 149.271…
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344, 167.93…

Our pardalotes data also contains the state/territory of each occurrence in column cl22. We can group by state/territory and summarise the number of occurrences to get an overall count by state/territory.

counts_by_state <- pardalotes |>
  group_by(cl22) |>
  count()

counts_by_state
# A tibble: 9 × 2
# Groups:   cl22 [9]
  cl22                              n
  <chr>                         <int>
1 Australian Capital Territory  53413
2 New South Wales              114431
3 Northern Territory             9768
4 Queensland                   109084
5 South Australia               24513
6 Tasmania                      24349
7 Victoria                     184457
8 Western Australia             33892
9 <NA>                           2468

To prepare our data for mapping, we can join counts_by_state to aus using the state/territory name as our key.

aus_counts <- aus |>
  left_join(counts_by_state,
            join_by(NAME == cl22))

aus_counts
Simple feature collection with 9 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 3
  NAME                                                           geometry      n
  <chr>                                                <MULTIPOLYGON [°]>  <int>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782… 114431
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259… 184457
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575,… 109084
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885…  24513
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264…  33892
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258…  24349
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922…   9768
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, …  53413
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344,…     NA

Now we can use these data to create a choropleth map2.

ggplot() +
  geom_sf(data = aus_counts,
          aes(fill = n)) + 
  guides(fill = guide_coloursteps(title = "Number of\nObservations")) +
  scale_fill_viridis_c(option = "G") +
  theme_void()

The sf package also has specialised functions for spatial joins like st_join(), which can be especially useful for joins using points (e.g., POINT) and shapes (e.g., POLYGON, MULTIPOLYGON). Below is a small example where we use the point location to join with the state/territory. Note that we lose the POINT location in favour of the state MULTIPOLYGON shape, held in the column geometry.

# convert record coordinates to sf POINT class
pardalotes_sf <- pardalotes |>
  tidyr::drop_na() |>
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), 
           crs = 4326)

# join points to aus states that intersect spatially
states_with_species <- st_join(x = aus, 
                               y = pardalotes_sf,
                            join = st_intersects,
                            left = FALSE)

states_with_species |>
  rmarkdown::paged_table() # paged output

  1. This shapefile contains state/territory outlines as vectors. See this section on vectors to learn more about what a vector is.↩︎

  2. For more advanced examples of making choropleth maps, check out the ALA Labs articles counting points in shapefiles and using multiple colour scales.↩︎